The analysis is based on an unknown sample obtained from either mice or human tissues. After tissue dissociation, cells were sorted by FACS and scRNA-seq libraries were prepared with the Smart-Seq2 protocol/platform.

Load necessary libraries

library(tidyverse)
library(Seurat)
library(patchwork)
library(dplyr)
library(SingleR)
library(celldex) 
library(SummarizedExperiment) 
library(HGNChelper) #contains functions for identifying and correcting HGNC human gene symbols and MGI mouse gene symbols 
library(openxlsx)

Load count matrix

The starting point is a digital count matrix with mouse or human genes as features. The analysis will focus on the XRA.rds file.

load("./data/XRA.rds") # count matrix

A Seurat object is created to store both the count matrix and the following analysis steps (e.g. PCA).

XRA <- CreateSeuratObject(counts = counts, # give in input the count matrix 
                           project = "XRA", # name of the project
                           min.cells = 3,   # filter for genes (rows)
                           # keep cells that have genes expressed in at least 3 cells
                           min.features = 50 # filter for cells (columns)
                           # keep cells that have at least 50 features
                           )
XRA
## An object of class Seurat 
## 42740 features across 5043 samples within 1 assay 
## Active assay: RNA (42740 features, 0 variable features)
##  1 layer present: counts

The analysis is performed using Seurat v. 5.1.0. This version allows to store data in layers. In layer="counts" are stored the raw and un-normalized counts. With the following command, it is possible to check the data and try to access the count matrix.

LayerData(XRA, assay="RNA", layer="counts")[500:505, 1:30]
## 6 x 30 sparse Matrix of class "dgCMatrix"
##   [[ suppressing 30 column names 'GJOXFX', 'YRQBFE', 'RJUMSZ' ... ]]
##                                                                             
## CELA3A     .    .  . .   . . .  . .   . .   . .    .   . .   . . . . .   . .
## LINC01635  .    .  . .   . . .  . .   . .   . .    .   . .   . . . . .   . .
## LINC00339  .    .  . .   . . .  . .   . .   . .    .   . .   . . . . .   . .
## CDC42      . 1800 42 . 727 . . 11 1 410 . 609 . 1239 902 . 767 9 . . 8 219 2
## AL031281.2 .    .  . .   . . .  . .   . .   . .    .   . .   . . . . .   . .
## CDC42-IT1  .    .  . .   . . .  . .   . .   . .    .   . .   . . . . .   . .
##                                      
## CELA3A        . .   . .   .    .    .
## LINC01635     . .   . .   .    .    .
## LINC00339     . .   . .   .    .  303
## CDC42      1792 . 234 . 940 1692 1816
## AL031281.2    . .   . .   .    .    .
## CDC42-IT1     . .   . .   .    .    .

The name of the genes are reported in uppercase, so it is possible to assume that the unknown sample is obtained from human.

Q1 - Quality control and filtering

To obtain the cleaned matrix of data, the standard workflow of quality control and filtering was used.

Filtering based on the percentage of mitochondrial

The filtering based on the percentage of mitochondrial genes aims to filter out genes with increased number of genes that map to a mitochondrial genome. This is necessary because low-quality or dying cells often exhibit extensive mitochondrial contamination.

XRA[["percent_mt"]] <- PercentageFeatureSet(XRA, pattern = "^MT-")
max(XRA[["percent_mt"]])
## [1] 93.63296

Filtering based on the spike-in RNAs

As additional quality control, the filtering based on spike-in RNAs starting with “ERCC” is performed. (ERCC stands for External RNA Controls Consortium.)

XRA[["percent_ERCC"]] <- PercentageFeatureSet(XRA, pattern = "^ERCC")
max(XRA[["percent_ERCC"]])
## [1] 0.772809

Visualize QC metrics

It is possible to visualize the quality control metrics as a violin plot.

p1<- VlnPlot(XRA, # object in input
             features = c("nFeature_RNA", "nCount_RNA", "percent_mt", "percent_ERCC"), 
             # features from metadata that we want to show
             ncol = 4, # number of columns to show in the plot
             pt.size = 0.01) # dot size

p1

By looking at the violin plot showing the number of features, it is possible to see some outliers cells with high levels of number of features. This cells might be doublets or multiplets. There are also cells with a very low number of features. These might be dying cells.

By looking at the percentage of mitochondria, we see some cells that are outliers (with high levels of mitochondrial genes). So, in this case, we can discard these cells because they are dying cells, low quality cells or empty droplets.

To inspect the relationship between number of counts and percentage of mitochondrial genes and number of features, the function FeatureScatter() was used.

plot1 <- FeatureScatter(XRA, feature1 = "nCount_RNA", feature2 = "percent_mt")
# specify the two features we want to correlate

plot2 <- FeatureScatter(XRA, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

plot1 + plot2

Based on the Violin plots, the filtering will be performed by setting the following thresholds:

XRA <- subset(XRA, 
                 subset =  nFeature_RNA < 3000 & 
                  percent_mt < 15 &
                  percent_ERCC < 0.1
                 )

Q2 - Normalization, identification of variable features, scaling (normal procedure or sctransform)

Normalization

After the filtering of the dataset, the workflow proceed with the normalization.

Log Normalization

By default, Seurat employs a global-scaling normalization method LogNormalize that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result (to reduce skewness). Normalized values are stored in the data layer (XRA[["RNA"]]$data).

XRA <- NormalizeData(XRA, 
                      normalization.method = "LogNormalize", 
                      scale.factor = 10000)
## Normalizing layer: counts
XRA[["RNA"]]$data[1:10,1:30] # inspect normalized counts
## 10 x 30 sparse Matrix of class "dgCMatrix"
##   [[ suppressing 30 column names 'RJUMSZ', 'JUYFQD', 'UONUAM' ... ]]
##                                                                                
## DDX11L1     . . . . .          . . . . . . . . .          . . . . . . . . . . .
## WASH7P      . . . . 0.03539945 . . . . . . . . 0.00563619 . . . . . . . . . . .
## MIR6859-1   . . . . .          . . . . . . . . .          . . . . . . . . . . .
## MIR1302-2HG . . . . .          . . . . . . . . .          . . . . . . . . . . .
## OR4F5       . . . . .          . . . . . . . . .          . . . . . . . . . . .
## AL627309.1  . . . . .          . . . . . . . . 0.01681416 . . . . . . . . . . .
## CICP27      . . . . .          . . . . . . . . .          . . . . . . . . . . .
## AL627309.6  . . . . .          . . . . . . . . .          . . . . . . . . . . .
## AL627309.7  . . . . .          . . . . . . . . .          . . . . . . . . . . .
## AL627309.5  . . . . .          . . . . . . . . .          . . . . . . . . . . .
##                               
## DDX11L1     .          . . . .
## WASH7P      .          . . . .
## MIR6859-1   .          . . . .
## MIR1302-2HG .          . . . .
## OR4F5       .          . . . .
## AL627309.1  .          . . . .
## CICP27      .          . . . .
## AL627309.6  .          . . . .
## AL627309.7  .          . . . .
## AL627309.5  0.01199446 . . . .

When this normalization method is applied, the assumption is that each cell should have the same number of reads, so anytime a different number of reads is observed, that difference is technical and not biological, and it needs to be corrected. However, this assumption is questionable, especially in single-cell analysis, because it is known that cells donʼt have the same number of transcripts (it depends on the cell type).

Identification of highly variable features (feature selection)

After the log normalization, the workflow continues with the selection of a subset of features (genes) that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

The procedure to select variable features is implemented in the FindVariableFeatures function (the procedure models the mean-variance relationship inherent in single-cell data). By default, the function returns the 2,000 most variable features per dataset. These will be used in downstream analysis, like PCA.

XRA <- FindVariableFeatures(XRA, 
                             selection.method = "vst", 
                             nfeatures = 2000)
## Finding variable features for layer counts
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(XRA), 10)

The top 10 variable features are:

top10
##  [1] "ACTA1"   "CKM"     "TPSB2"   "TPSAB1"  "CXCL8"   "TNNC2"   "PLA2G2A"
##  [8] "C7"      "MYL2"    "ENO3"
# plot variable features with labels
plot1 <- VariableFeaturePlot(XRA)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot2
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

In the plot reported above, each red dot is a highly variable feature.

Scaling the data

By scaling, Seurat applies a linear transformation to the expression levels of each gene, that is a standard pre-processing step prior to dimensional reduction techniques like PCA.

The ScaleData function:

  • Shifts the expression of each gene, so that the mean expression across cells is 0

  • Scales the expression of each gene, so that the variance across cells is 1 (z-score transformation)

This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate

The results of this are stored in XRA[["RNA"]]$scale.data.

all_genes <- rownames(XRA) # perform scaling on all genes (by default, only the top 2000 are scaled)

XRA <- ScaleData(XRA, 
                  features = all_genes) # specify which features to scale
## Centering and scaling data matrix
XRA[["RNA"]]$scale.data[1:5,1:10]
##                  RJUMSZ      JUYFQD      UONUAM      TSMWLG      TFSDGD
## DDX11L1     -0.03302852 -0.03302852 -0.03302852 -0.03302852 -0.03302852
## WASH7P      -0.07904836 -0.07904836 -0.07904836 -0.07904836  0.46208668
## MIR6859-1   -0.03125252 -0.03125252 -0.03125252 -0.03125252 -0.03125252
## MIR1302-2HG -0.01635612 -0.01635612 -0.01635612 -0.01635612 -0.01635612
## OR4F5       -0.01748152 -0.01748152 -0.01748152 -0.01748152 -0.01748152
##                  GBQWUU      LKVPRJ      HZQFDM      YPZTJL      ZMMETL
## DDX11L1     -0.03302852 -0.03302852 -0.03302852 -0.03302852 -0.03302852
## WASH7P      -0.07904836 -0.07904836 -0.07904836 -0.07904836 -0.07904836
## MIR6859-1   -0.03125252 -0.03125252 -0.03125252 -0.03125252 -0.03125252
## MIR1302-2HG -0.01635612 -0.01635612 -0.01635612 -0.01635612 -0.01635612
## OR4F5       -0.01748152 -0.01748152 -0.01748152 -0.01748152 -0.01748152

SCTransform

An alternative normalization method could be sctransform. This method was introduced in Hafemeister and Satija, Genome Biology 2019 and with its procedure omits the need for heuristic steps including pseudocount addition or log-transformation. This results in an improvement of the common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression.

SCTransform is a single command that replaces NormalizeData(), ScaleData(), and FindVariableFeatures(). Transformed data are available in the SCT assay in the SEURAT object, which is set as the default after running sctransform.

In the following chunks is reported the alternative workflow. However, the SCTransform method will not be used.

XRA <- SCTransform(XRA, verbose = FALSE)
XRA[["SCT"]]@scale.data[1:5,1:10]

Q6 - Cell cycle analysis

The presence of cell cycle within scRNAseq data may introduce variability. If the experiment does not focus on the cell cycle, strategies can be employed to address this heterogeneity.

In Seurat, handling cell cycle effects involves computing phase scores using established markers and then removing them during data pre-processing to alleviate the impact of cell cycle heterogeneity on scRNA-seq data.

Load marker genes

A list of cell cycle markers for the S and G2/M phases, derived from Tirosh et al, 2015, is loaded with Seurat. This list can be segregated into markers of G2/M phase and markers of S phase. These two list of markers (denoting G2/M and S cell phase) will be used to find matches against the cell from XRA dataset.

s_genes <- cc.genes$s.genes # replication phase
g2m_genes <- cc.genes$g2m.genes # division phase

Assign Cell-Cycle Scores

First, each cell is assigned a score, based on its expression of G2/M and S phase markers. These marker sets should be anticorrelated in their expression levels, and cells expressing neither are likely not cycling and in G1 phase.

Scores are assigned in the CellCycleScoring function, which stores S and G2/M scores in object meta data, along with the predicted classification of each cell in either G2M, S or G1 phase. The results are stored in the meta data.

CellCycleScoring can also set the identity of the Seurat object to the cell-cycle phase by passing set.ident = TRUE (the original identities are stored as old.ident). Please note that Seurat does not use the discrete classifications (G2M/G1/S) in downstream cell cycle regression. Instead, it uses the quantitative scores for G2M and S phase. However, the predicted classifications are provided in case they are of interest.

cell_cycle <- CellCycleScoring(XRA, 
                           s.features = s_genes, # markers S phase
                           g2m.features = g2m_genes, # markers G2M phase
                           set.ident = TRUE) # change identity from cluster to cell cycle
# view cell cycle scores and phase assignments
head(cell_cycle[[]])
##        orig.ident nCount_RNA nFeature_RNA percent_mt percent_ERCC     S.Score
## RJUMSZ        XRA    1332675         1482   4.092296 0.0000000000 -0.05881237
## JUYFQD        XRA     169167          780   7.389739 0.0000000000 -0.03958919
## UONUAM        XRA     813984         1549   6.513027 0.0000000000 -0.05306389
## TSMWLG        XRA    1612853         1669  11.803370 0.0060761892 -0.01453037
## TFSDGD        XRA    1110079         1785   5.572126 0.0000000000 -0.06242548
## GBQWUU        XRA     915870         1476   4.183454 0.0002183716 -0.02143424
##          G2M.Score Phase old.ident
## RJUMSZ -0.01538705    G1       XRA
## JUYFQD  0.01164100   G2M       XRA
## UONUAM  0.03736741   G2M       XRA
## TSMWLG -0.06770569    G1       XRA
## TFSDGD -0.02714501    G1       XRA
## GBQWUU -0.06982177    G1       XRA
# convert Seurat Object to Data frame
df_XRA <- as.data.frame(cell_cycle[[]], genes = Seurat::VariableFeatures(cell_cycle[[]]), fix_names = TRUE)

Visualize the distribution of cell cycle markers

RidgePlot(cell_cycle, 
          features = c("NASP", "SLBP", "TUBB4B", "ANP32E"), 
          ncol = 2)
## Picking joint bandwidth of 0.101
## Picking joint bandwidth of 0.0807
## Picking joint bandwidth of 0.149
## Picking joint bandwidth of 0.0661

The Ridge Plot above shows the expression levels of 2 marker genes of S phase (NASP and SLBP) and 2 marker genes of G2M phase.

Then, the PCA on the cell cycle genes is performed.

cell_cycle <- RunPCA(cell_cycle, features = c(s_genes, g2m_genes))
## Warning in PrepDR5(object = object, features = features, layer = layer, : The
## following features were not available: MLF1IP, FAM64A, HN1.
## Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
## a percentage of total singular values, use a standard svd instead.
## PC_ 1 
## Positive:  ANP32E, CKS2, TUBB4B, NASP, SLBP, HMGB2, RPA2, PCNA, SMC4, RRM1 
##     CBX5, CTCF, MCM6, TMPO, GMNN, LBR, USP1, CASP8AP2, CKS1B, UNG 
##     RANGAP1, MCM5, TACC3, RFC2, NUSAP1, CKAP2, UBR7, NCAPD2, HELLS, MCM4 
## Negative:  TIPIN, CCNB2, UHRF1, TTK, RAD51, CDCA7, E2F8, ANLN, CDC25C, KIF23 
##     BRIP1, HJURP, KIF2C, CENPA, AURKB, POLA1, CLSPN, DTL, MSH2, EXO1 
##     GTSE1, GINS2, NEK2, NUF2, CDC45, CDCA2, DLGAP5, CDC20, PRIM1, PSRC1 
## PC_ 2 
## Positive:  HMGB2, SLBP, LBR, MCM6, CTCF, USP1, CKAP2, KIF20B, SMC4, NUSAP1 
##     ANP32E, TACC3, NDC80, ATAD2, PCNA, HELLS, TMPO, CKAP5, RANGAP1, CCNE2 
##     CENPE, WDR76, DLGAP5, GINS2, MKI67, CLSPN, TOP2A, HMMR, DTL, POLD3 
## Negative:  TUBB4B, CKS2, MCM5, RFC2, RRM1, UNG, RPA2, GAS2L3, CBX5, CKS1B 
##     UBR7, ECT2, G2E3, GMNN, TIPIN, NASP, CDC6, MSH2, POLA1, CHAF1B 
##     BIRC5, TYMS, MCM2, UBE2C, CDK1, CDC20, PSRC1, CENPA, KIF2C, CENPF 
## PC_ 3 
## Positive:  SLBP, CKS1B, RANGAP1, CKS2, HMGB2, CBX5, GMNN, NCAPD2, MCM5, RPA2 
##     MCM6, GAS2L3, TACC3, NUSAP1, CTCF, MCM4, CENPF, CHAF1B, CDCA3, MKI67 
##     BIRC5, HMMR, TPX2, CDCA2, BLM, CKAP2L, TOP2A, UBE2C, BUB1, ANP32E 
## Negative:  TMPO, LBR, USP1, UBR7, CKAP5, RRM1, SMC4, NASP, POLD3, ECT2 
##     UNG, MSH2, CKAP2, TUBB4B, PRIM1, RFC2, POLA1, KIF20B, G2E3, CASP8AP2 
##     TYMS, TIPIN, GINS2, DLGAP5, HELLS, WDR76, PCNA, CDC45, PSRC1, CCNE2 
## PC_ 4 
## Positive:  MCM5, RANGAP1, CTCF, POLD3, GAS2L3, LBR, TMPO, SMC4, CKS2, NCAPD2 
##     MCM6, CKAP5, TUBB4B, CBX5, CDK1, TYMS, TACC3, FEN1, CENPF, ECT2 
##     PSRC1, CKAP2, CENPE, BLM, RPA2, HELLS, BIRC5, TPX2, CKAP2L, NUSAP1 
## Negative:  GMNN, CKS1B, RRM1, UNG, PRIM1, RFC2, G2E3, UBR7, USP1, CASP8AP2 
##     SLBP, PCNA, ANP32E, TIPIN, POLA1, NDC80, ATAD2, MSH2, NASP, KIF20B 
##     HMGB2, MCM4, GTSE1, CLSPN, CDC20, HJURP, CDCA7, GINS2, CDC45, NUF2 
## PC_ 5 
## Positive:  CKS1B, PCNA, CTCF, CBX5, POLD3, G2E3, POLA1, SLBP, SMC4, RFC2 
##     UBR7, RPA2, CKAP5, NCAPD2, HELLS, MSH2, MCM5, CHAF1B, ATAD2, TMPO 
##     CENPE, DLGAP5, BUB1, FEN1, PSRC1, CDCA3, TOP2A, GAS2L3, UBE2C, TPX2 
## Negative:  CASP8AP2, HMGB2, UNG, RANGAP1, CKS2, TUBB4B, NASP, ANP32E, MCM6, BLM 
##     AURKA, GMNN, KIF20B, LBR, CDC45, CDK1, NDC80, TIPIN, RRM1, CCNE2 
##     TACC3, CLSPN, GINS2, HJURP, AURKB, GTSE1, USP1, NEK2, CDC6, CENPF
DimPlot(cell_cycle)

By looking at the PCA, it is possible to observe that cells in G1 are clustered together, while there is a overlap between cells in G2M phase and cells in S phase.

Regress out cell cycle scores during data scaling

The analysis aims to subtract (‘regress out’) this source of heterogeneity from the data. For each gene, Seurat models the relationship between gene expression and the S and G2M cell cycle scores. The scaled residuals of this model represent a ‘corrected’ expression matrix, that can be used downstream for dimensional reduction.

After this regression, a PCA on the variable genes no longer returns components associated with cell cycle. ALERT (long time step)

cell_cycle <- ScaleData(cell_cycle,
                        vars.to.regress = c("S.Score", "G2M.Score"), 
                        features = rownames(cell_cycle))
## Regressing out S.Score, G2M.Score
## Centering and scaling data matrix
cell_cycle <- RunPCA(cell_cycle, 
                     features = , 
                     nfeatures.print = 10)
## PC_ 1 
## Positive:  CD74, RNASE1, PECAM1, VWF, AQP1, THBD, ACKR1, TM4SF1, FABP5, ADAMTS9 
## Negative:  DCN, C1S, APOD, C3, ADH1B, COL6A3, CFD, GPX3, MFAP5, CILP 
## PC_ 2 
## Positive:  CKM, ACTA1, ENO3, DES, MYL1, ANKRD2, SLN, MYBPC1, MYOZ1, MB 
## Negative:  CD74, TM4SF1, VWF, PECAM1, A2M, AQP1, ADAMTS9, THBD, ACKR1, IL6 
## PC_ 3 
## Positive:  A2M, TM4SF1, ADAMTS1, VWF, ADAMTS9, ADAMTS4, AQP1, IL6, PECAM1, CCL2 
## Negative:  FCER1G, LYZ, IL1B, BCL2A1, CD163, S100A9, GPR183, CXCR4, MS4A7, C1QA 
## PC_ 4 
## Positive:  NOTCH3, ACTA2, RGS5, GJA4, HIGD1B, TAGLN, GJC1, CDH6, LMOD1, SSTR2 
## Negative:  ACKR1, SELE, PLVAP, CD74, PLAT, RAMP3, CSF3, ICAM1, PECAM1, VWF 
## PC_ 5 
## Positive:  CXCR4, CD69, GZMB, GNLY, IL7R, CTSW, PRF1, NKG7, MYF5, GZMH 
## Negative:  CD163, LYZ, IL1B, CXCL8, FCER1G, C1QA, F13A1, PLAUR, MS4A7, BCL2A1
DimPlot(cell_cycle)

By looking at the PCA, it is possible to observe that cells don’t separate by cell-cycle phase.

XRA <- cell_cycle
rm(s_genes,g2m_genes,cell_cycle)

Q3 - Dimensionality reduction

Dimensionality reduction techniques are employed to reduce data complexity in downstream analyses (e.g. clustering) and for data visualisation.

PCA

Principal component analysis (PCA) is the most popular dimension reduction method. For each cell, the original dimensions correspond to the expression values of each gene.

PCA performs an orthogonal transformation of the original dataset to create a set of new, uncorrelated variables or principal components. These principal components are linear combinations of variables in the original dataset. The transformation is defined such that the principal components are ranked with decreasing order of variance. Thus the first principal component amounts to the largest possible variance.

The idea is to discard the principal components with the lowest variance, and effectively reduce the dimensions of the dataset without much loss of information.

By default, in Seurat only the previously determined variable features are used as input for PCA, but this can be defined using features argument if you wish to choose a different subset.

XRA <- RunPCA(XRA, 
               features = VariableFeatures(object = XRA), # predetermined variable features
               verbose = F)

PCA is highly interpretable and computationally efficient, but it is inappropriate for scRNA-seq data visualization.

scRNA-seq data is sparse due to dropout events (weakly expressed genes are missed), meaning there are 60–80% zeroes in the data matrix. It is a highly non-linear structure, while PCA is a linear dimensional reduction technique and therefore deemed very inappropriate for data visualisation. PCA is only used to select ~top 10–50 principal components that can be processed with downstream applications like cluster analysis.

Other dimensionality reduction techniques are used for visualization (t-SNE, UMAP).

Seurat provides several ways of visualizing both cells and features that define the PCA, including VizDimReduction, DimPlot, and DimHeatmap. PCA results are stored in XRA[["pca"]].

VizDimLoadings(XRA, dims = 1:2, reduction = "pca")

DimPlot(XRA, reduction = "pca")

DimHeatmap allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Setting “cells” to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets.

DimHeatmap(XRA, 
           dims = 1:2, #number of dimensions
           cells = 500, 
           balanced = TRUE)

Determine the ‘dimensionality’ of the dataset

To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, it is needed to decide how many components should be included.

So, a heuristic method is used to decide the number of PC to consider. This method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each one (ElbowPlot function).

ElbowPlot(XRA)

In this case, the elbow can be observed around PC 6. This number will be used as input for the clustering step.

Q4 - Clustering

Cluster cells

Seurat applies a graph-based clustering approach. The distance metric which drives the clustering analysis is based on previously identified PCs. The approach to partitioning the cellular distance matrix into clusters is the following: cells are embedded in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then there is an attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’. Seurat first constructs a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors function, and takes as input the previously defined dimensionality of the dataset.

Then, to cluster the cells, modularity optimization techniques (such as the Louvain algorithm (default)) are applied to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters.

The clusters can be found using the Idents function.

XRA <- FindNeighbors(XRA, reduction = "pca", dims = 1:6)
## Computing nearest neighbor graph
## Computing SNN
XRA <- FindClusters(XRA, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 3738
## Number of edges: 113358
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9100
## Number of communities: 9
## Elapsed time: 0 seconds
head(Idents(XRA), 5)
## RJUMSZ JUYFQD UONUAM TSMWLG TFSDGD 
##      3      4      3      3      3 
## Levels: 0 1 2 3 4 5 6 7 8

Visualize clusters with PCA

DimPlot(XRA, reduction = "pca", label = T)

Non-linear dimensional reduction for visualization (UMAP/tSNE)

Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. The same PCs as input to the clustering analysis are used as input to the UMAP and tSNE.

tSNE (t-Stochastic Neighbourhood Embedding)

t-SNE is a graph based, non-linear dimensionality reduction technique. It projects high dimensional data onto 2D or 3D components. It works by calculating pairwise similarities between data points, emphasizing close similarities and using Gaussian distributions to define relationships. Then, it maps the data to a lower-dimensional space (typically two dimensions) while preserving these pairwise similarities. t-SNE employs an iterative optimization process to position data points in the lower-dimensional map, ensuring that similar data points remain close, while dissimilar ones are separated. This makes it a valuable tool for exploring and visualizing intricate data structures and identifying clusters or patterns within the data.

Pros

t-SNE powerfully captures the non-linearity in high dimensional datasets and is able to retain the local structures in low dimensions. This is a huge improvement over PCA. t-SNE has been used as a gold standard method for scRNA-seq data visualisation.

Cons

The way t-SNE works, it is impossible for it to preserve the global structure while performing dimension reduction. Only local structures are preserved, while the distances between groups are drastically different depending on the run.

XRA <- RunTSNE(XRA, dims = 1:6)
DimPlot(XRA, reduction = "tsne",label=T)

UMAP (Uniform Manifold Approximation and Projection)

UMAP is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. UMAP is a relatively new dimensional reduction technique introduced by McInnes et al in 2018. (McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018)

The algorithm is graph based and principally similar to t-SNE where it constructs a high dimensional graph representation of the data, then optimizes a low-dimensional graph to be as structurally similar as possible.

Pros

  • Non linear datasets: UMAP is manifold learning dimension reduction technique and thus captures the non linearity of real world datasets. It is comparable to t-SNE in terms of data visualisation.

  • The mathematical improvements in UMAP allow superior run time performance over t-SNE

  • In comparison to t-SNE, UMAP offers better preservation of a data’s global structure.

  • Unlike t-SNE, UMAP has no computational restrictions on embedding dimensions and can be used as an effective pre-processing step to boost the performance of density based clustering algorithms.

Cons

  • Lacks interpretability: Unlike PCA, where the principal components are directions of greatest variance of the source data, the lower dimension embeddings of UMAP lack strong interpretability.

  • One of the core assumptions of UMAP is that there exists manifold structure in the data. Because of this, UMAP can tend to find manifold structure within the noise of a dataset.

XRA <- RunUMAP(XRA, dims = 1:6)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 19:56:18 UMAP embedding parameters a = 0.9922 b = 1.112
## 19:56:18 Read 3738 rows and found 6 numeric columns
## 19:56:18 Using Annoy for neighbor search, n_neighbors = 30
## 19:56:18 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 19:56:18 Writing NN index file to temp file /var/folders/ry/9rv81k4s20xdgj8v_km36dxm0000gn/T//RtmpQzTPIE/fileb9c83ed2747e
## 19:56:18 Searching Annoy index using 1 thread, search_k = 3000
## 19:56:19 Annoy recall = 100%
## 19:56:20 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 19:56:21 Initializing from normalized Laplacian + noise (using RSpectra)
## 19:56:21 Commencing optimization for 500 epochs, with 147076 positive edges
## 19:56:26 Optimization finished
DimPlot(XRA, reduction = "umap",label=T)

Q5 - Identification of marker genes

Seurat can find markers that define clusters via differential expression. By default, setting only ident.1, it identifes positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.

The min.pct argument requires a feature to be detected at a minimum percentage in either of the two groups of cells, and the thresh.test argument requires a feature to be differentially expressed (on average) by some amount between the two groups.

The default test used is the Wilcoxon Rank Sum test.

With the following chunks, the aim is to find markers for every clusters compared to all remaining cells. Only the positive ones are reported in a table.

XRA_markers <- FindAllMarkers(XRA, 
                               only.pos = TRUE, # only positive markers
                               min.pct = 0.25, # test genes that are detected in minimum 25% 
                               # of cells in either of the two population
                               logfc.threshold = 0.25) # threshold for log fold-change
## Calculating cluster 0
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
head(XRA_markers)
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster   gene
## C3         0   4.116145 0.954 0.080         0       0     C3
## DCN        0   3.718785 0.999 0.126         0       0    DCN
## CILP       0   4.461804 0.921 0.053         0       0   CILP
## COL6A3     0   4.358773 0.957 0.102         0       0 COL6A3
## C1S        0   3.949805 0.989 0.134         0       0    C1S
## CFD        0   3.755910 0.961 0.110         0       0    CFD

Some of the columns present in the obtained data frame are:

XRA_positive_markers <- XRA_markers %>%
  group_by(cluster) %>%
  slice_max(n=2,order_by=avg_log2FC)
XRA_positive_markers
## # A tibble: 18 × 7
## # Groups:   cluster [9]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene      
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>     
##  1 1.12e-137       5.10 0.25  0.011 4.80e-133 0       WNT5A     
##  2 0               4.77 0.747 0.06  0         0       ABCA9     
##  3 9.00e-187       9.20 0.278 0.002 3.85e-182 1       PRF1      
##  4 0               8.58 0.509 0.006 0         1       CD2       
##  5 0               6.15 0.687 0.012 0         2       SELP      
##  6 0               6.03 0.576 0.029 0         2       VCAM1     
##  7 3.13e-254       5.44 0.436 0.015 1.34e-249 3       BTNL9     
##  8 6.66e- 86       4.90 0.319 0.051 2.85e- 81 3       CR381670.1
##  9 0              11.6  0.552 0.001 0         4       MYF5      
## 10 4.43e-214       8.51 0.312 0.003 1.89e-209 4       DLK1      
## 11 0              12.8  0.623 0.004 0         5       MS4A7     
## 12 0              11.4  0.84  0.004 0         5       LYZ       
## 13 5.99e- 10       3.29 0.473 0.34  2.56e-  5 6       MT1XP1    
## 14 2.98e-131       3.01 0.905 0.279 1.27e-126 6       APOD      
## 15 7.16e-274      11.0  0.387 0.003 3.06e-269 7       CDH6      
## 16 4.10e-239      10.9  0.327 0.001 1.75e-234 7       AVPR1A    
## 17 0              13.5  0.504 0.001 0         8       HSPB3     
## 18 2.11e-249      12.6  0.372 0.002 9.03e-245 8       MYH7

Visualize marker expression

Seurat offers several tools for visualizing marker expression.

VlnPlot shows expression probability distributions across clusters

In the violin plots, it is possible to specify which features we want to visualize. In the x-axis is reported the identity (the cluster), while on the y-axis, the expression level.

VlnPlot(XRA, features = c("APOD", "COL6A3"), pt.size = 0)

RidgePlot also shows expression probability distributions.

This function allows to see the expression levels of a particular gene in all the clusters. So, it is possible to use this visualization to identify marker specific for one cluster or another.

RidgePlot(XRA, features = c("APOD", "COL6A3"))

# save plot as pdf
pdf("img/15_RidgePlot.pdf")
RidgePlot(XRA, features = c("APOD", "COL6A3"))
dev.off()
## quartz_off_screen 
##                 2

FeaturePlot visualizes feature expression on a tSNE, UMAP or PCA plot

This plot allows to see the expression of a particular feature in UMAP (so, in the dimensionality reduction space).

FeaturePlot(XRA, features = c("APOD", "COL6A3"), order =T )

Each cell is colored according to the expression of that particular feature.

DotPlot

Intuitive way of visualizing how feature expression changes across different identity classes (clusters). The size of the dot encodes the percentage of cells within a class, while the color encodes the Average Expression level of cells within a class (blue is high).

DotPlot(XRA, 
        features = c("WNT5A","ABCA9","PRF1","CD2","SELP","VCAM1","BTNL9","CR381670.1","MYF5","DLK1","MS4A7","LYZ","MT1XP1", "APOD", "CDH6", "AVPR1A","HSPB3", "MYH7") # features we want to observe
        ) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

DoHeatmap generates an expression heatmap for given cells and features

In this case, the top 3 markers (or all markers if less than 3) for each cluster are plotted.

top3 <- XRA_markers %>% 
  group_by(cluster) %>% 
  top_n(n = 3, # top 3 for each cluster
        wt = avg_log2FC)

DoHeatmap(XRA, features = top3$gene) + NoLegend()

Interactive plotting features

Seurat utilizes R’s plotly graphing library to create interactive plots.

plot <- FeaturePlot(XRA, features = "APOD")
HoverLocator(plot = plot, information = FetchData(XRA, vars = c("ident","percent_mt", "nFeature_RNA")))

Q7- Annotation

The final step of the analysis consists on trying to annotate the cell with annotated dataset. To do that, the package SingleR will be used.

SingleR (Aran et al. 2019)

SingleR can be considered a robust variant of nearest-neighbors classification, with some tweaks to improve resolution for closely related labels.

To work, it needs:

  • Reference set: A comprehensive transcriptomic dataset (single-cell or bulk) of pure cell types, preferably with multiple samples per cell type. In this case, will be loaded a reference set for human and one for mice.

  • Single-cell set: Single-cell RNA-seq dataset with unknown cell types. In this case, XRA will be the unknown dataset.

SingleR runs in two modes:

  • Single-cell: the annotation is performed for each single-cell independently.

  • Cluster: the annotation is performed on predefined clusters, where the expression of a cluster is the sum expression of all cells in the cluster. Faster and computationally efficient, but the results are not so reliable to respect to the single-cell approach.

The approach can be divided into 3 main steps:

  1. Spearman coefficient is calculated for single-cell expression with each of the samples in the reference dataset. The correlation analysis is performed only on variable genes in the reference dataset.

  2. Multiple correlation coefficients per cell types are aggregated to provide a single value per cell type per single-cell.

  3. In this step SingleR reruns the correlation analysis, but only for the top cell types from step 2. The analysis is performed only on variable genes between these cell types. The lowest value cell type is removed, and then this step is repeated until only two cell types remain. The cell type corresponding to the top value after the last run is assigned to the single-cell.

In the XRA dataset, all gene symbols have letters in upper case, meaning that they are from human. So, in the following analysis is reported the annotation with human reference.

# load reference human
ref.human <- celldex::HumanPrimaryCellAtlasData()
ref.human
## class: SummarizedExperiment 
## dim: 19363 713 
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(3): label.main label.fine label.ont
#ref.human$label.main[1:50]

To continue the analysis, it is necessary to convert the XRA dataset into a SingleCellExperiment object.

XRA_exp <- as.SingleCellExperiment(DietSeurat(XRA))
pred_human <- SingleR(test=XRA_exp,
                      clusters = XRA@meta.data$seurat_clusters, 
                      # vector of cluster identities for each cell
                      ref=ref.human, 
                      labels=ref.human$label.main, 
                      fine.tune = TRUE, prune = TRUE)
pred_human_df <- data.frame(cluster=rownames(pred_human),
                             labels=pred_human$labels)

pred_human_df
##   cluster            labels
## 1       0 Tissue_stem_cells
## 2       1           T_cells
## 3       2 Endothelial_cells
## 4       3 Endothelial_cells
## 5       4 Tissue_stem_cells
## 6       5          Monocyte
## 7       6 Tissue_stem_cells
## 8       7 Tissue_stem_cells
## 9       8 Tissue_stem_cells

Once assigned a label, plotScoreHeatmap can be used to visualize the results. plotScoreHeatmap() displays the correlation-based scores for all clusters across all reference labels. Each cluster is a column while each row is a label in the reference dataset.

heatmap1 <- plotScoreHeatmap(pred_human)

Then, we add the annotation obtained to the original Seurat object to later plot the clusters with the labels.

XRA[["SingleR_human_labels"]] <- pred_human_df$labels[match(XRA@meta.data$seurat_clusters, pred_human_df$cluster)]

DimPlot(XRA, group.by = "SingleR_human_labels")

ScType

ScType accepts both positive and negative markers, i.e., gene that are not expected to be expressed in a particular cell type. Sctype provides its own marker database for human and mouse, obtained from the integration of the information available in the CellMarker database and PanglaoDB.

scType cell_type annotation:

  1. For each positive/negative marker compute specificity score, which indicate whether a gene is a marker for a specific cell types.

  2. The raw expression matrix is normalized and Z-transform (scale the expression of each gene across cells)

  3. The transformed matrix is multiply by the cell-type specificity score

  4. For each cell types the expression scores of all its positive markers are summarized into a single enrichment score by summing them and dividing by square root of their number. The same is done for the negative markers.

  5. The negative marker expression score is subtracted from the positive score to obtain the final enrichment score. Individual cells are assigned to a cell type based on the maximum value for the cell type marker set.

Prepare marker genes to use for the annotation

By default, scType use the in-built marker DB.

# DB file
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";
tissue <- c("Brain","Immune system","Pancreas","Liver","Eye","Kidney","Brain","Lung","Adrenal","Heart","Intestine","Muscle","Placenta","Spleen","Stomach","Thymus")

# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# prepare gene sets
gs_list <- gene_sets_prepare(db_, tissue)

The function sctype_score() performs the scoring process. It takes in input the scaling data (already preprocessed), even if we can give in input the raw counts by specifying scaled = FALSE.

# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")

es.max <- sctype_score(scRNAseqData = XRA[["RNA"]]$scale.data, 
                       scaled = TRUE, 
                       # specify the list of markers to use with gs and gs2
                       gs = gs_list$gs_positive, # positive markers
                       gs2 = gs_list$gs_negative) # negative markers
                       # in case there are no negative markers just set gs2 = NULL

# merge by cluster
cL_results <- do.call("rbind",
                      lapply(unique(XRA@meta.data$seurat_clusters),
                             function(cl){
                               es.max.cl = sort(rowSums(es.max[ ,rownames(XRA@meta.data[XRA@meta.data$seurat_clusters==cl, ])]), decreasing = !0)
                               head(data.frame(cluster = cl, 
                                               type = names(es.max.cl), 
                                               scores = es.max.cl, 
                                               ncells = sum(XRA@meta.data$seurat_clusters==cl)), 10)
                               }))

sctype_scores <- cL_results %>% group_by(cluster) %>% top_n(n = 1, wt = scores)  

Assign cell types to each cluster

The annotation obtained is then added to the meta data by creating another column called scType_labels. Then, the results are visualized.

# add annotation to meta data
XRA[["scType_labels"]] <- sctype_scores$type[match(XRA@meta.data$seurat_clusters, sctype_scores$cluster)]

# visualize the results
DimPlot(XRA, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'scType_labels')     

Comparison between ScType and SingleR results

As it can be observed from the plots reported below, the ScType annotation results to be more precise and able to assign a label to each cluster.

DimPlot(XRA, group.by = "SingleR_human_labels")

DimPlot(XRA, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'scType_labels') 

Automatically detect a tissue type of the dataset

ScType provides a functionality for automated guessing of a tissue type given in input a unknown dataset.

# load auto-detection function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/auto_detect_tissue_type.R")
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";

# guess a tissue type
tissue_guess <- auto_detect_tissue_type(path_to_db_file = db_, 
                                        seuratObject = XRA, 
                                        scaled = TRUE, 
                                        assay = "RNA")
# if scaled = TRUE, make sure the data is scaled, as seuratObject[[assay]]@scale.data is used. If you just created a Seurat object, without any scaling and normalization, set scaled = FALSE, seuratObject[[assay]]@counts will be used         

However, this chunck generates the following error and it wasn’t possible to infer the tissue type of the dataset.

Error in Z[cell_markers_genes_score[jj, "gene_"], ] : subscript out of bounds

To overcome this error, an attempt with the following chunck was made, implementing a for loop aiming to get tissue type of the dataset.

# DB file
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx"
#db_ <- "ScTypeDB_full.xlsx"

tissue <- c("Brain","Immune system","Pancreas","Liver","Eye","Kidney","Brain","Lung","Adrenal","Heart","Intestine","Muscle","Placenta","Spleen","Stomach","Thymus")
tissue
##  [1] "Brain"         "Immune system" "Pancreas"      "Liver"        
##  [5] "Eye"           "Kidney"        "Brain"         "Lung"         
##  [9] "Adrenal"       "Heart"         "Intestine"     "Muscle"       
## [13] "Placenta"      "Spleen"        "Stomach"       "Thymus"
# Load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")

# Load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")

# Initialize a list to store sctype_scores for each tissue
sctype_scores_list <- list()
score_sum_list <- list()  # Initialize a list to store sum of scores for each tissue

# Loop over each tissue
for (t in tissue) {
  cat("Now analyzing for tissue", t, "...\n")  # Print current tissue being analyzed
  # Prepare gene sets for the current tissue
  gs_list <- gene_sets_prepare(db_, t)
  
  # Calculate ScType scores for the current tissue
  es.max <- sctype_score(scRNAseqData = XRA[["RNA"]]$scale.data, 
                         scaled = TRUE, 
                         gs = gs_list$gs_positive, 
                         gs2 = NULL) #gs_list$gs_negative
  
  # Merge by cluster
  cL_results <- do.call("rbind", 
                        lapply(unique(XRA@meta.data$seurat_clusters), 
                               function(cl){es.max.cl = sort(rowSums(es.max[,rownames(XRA@meta.data[XRA@meta.data$seurat_clusters==cl, ])]), decreasing = TRUE)
                               head(data.frame(cluster = cl, 
                                               type = names(es.max.cl), 
                                               scores = es.max.cl, 
                                               ncells = sum(XRA@meta.data$seurat_clusters==cl)), 10)
  }))
  
  sctype_scores <- cL_results %>% group_by(cluster) %>% top_n(n = 1, wt = scores)  
  
  # Set low-confident (low ScType score) clusters to "unknown"
  sctype_scores$type[as.numeric(as.character(sctype_scores$scores)) < sctype_scores$ncells/4] <- "Unknown"
  
  # Store sctype_scores in the list
  sctype_scores_list[[t]] <- sctype_scores[, 1:3]
  
  # Calculate sum of scores for the current tissue and store it in the score_sum_list
  score_sum_list[[t]] <- sum(sctype_scores$scores)
  #XRA[["scType_labels"]] <- sctype_scores$type[match(XRA@meta.data$seurat_clusters, sctype_scores$cluster)]
  #DimPlot(XRA, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'scType_labels') 
}
## Now analyzing for tissue Brain ...
## Now analyzing for tissue Immune system ...
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Pancreas ...
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Liver ...
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Eye ...
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Kidney ...
## Now analyzing for tissue Brain ...
## Now analyzing for tissue Lung ...
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Adrenal ...
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Heart ...
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Intestine ...
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Muscle ...
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Placenta ...
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Spleen ...
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Stomach ...
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Now analyzing for tissue Thymus ...
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols
# Combine tissue names and corresponding score sums into a data frame
score_sum_df <- data.frame(tissue = names(score_sum_list), score_sum = unlist(score_sum_list))
score_sum_df<- score_sum_df %>% 
  arrange(score_sum)
score_sum_df
##                      tissue score_sum
## Spleen               Spleen  1756.479
## Placenta           Placenta  4307.872
## Muscle               Muscle  4557.946
## Adrenal             Adrenal  5200.845
## Brain                 Brain  5217.443
## Intestine         Intestine  5586.274
## Thymus               Thymus  5653.056
## Pancreas           Pancreas  5778.970
## Heart                 Heart  5792.748
## Eye                     Eye  5839.415
## Kidney               Kidney  6019.799
## Immune system Immune system  6109.616
## Stomach             Stomach  6522.849
## Lung                   Lung  6601.815
## Liver                 Liver  7201.311
# Create a bar plot using ggplot2 with the ordered dataframe
ggplot(score_sum_df, aes(x = reorder(tissue, score_sum), y = score_sum, fill = tissue)) +
  geom_bar(stat = "identity", color = "black",show.legend = FALSE) +
  labs(title = "Sum of ScType Scores for Each Tissue",
       x = "Tissue", y = "Sum of Scores") +
  theme_minimal() +
  coord_flip()  # Flip the plot

The tissue type of the dataset is the one with the highest ScType score. In this case, from the plot, it is possible to observe that the tissue with higher score is Liver.